- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 867
Add support for manual checkpoints #1709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| 🦋 Changeset detectedLatest commit: cea6cf4 The changes in this PR will be included in the next version bump. This PR includes changesets to release 12 packages
 Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR | 
| Caution Review failedThe pull request is closed. WalkthroughThe pull request implements several updates across multiple components. In the Docker setup, it adds  Changes
 Sequence Diagram(s)sequenceDiagram
    participant C as Client
    participant I as Internal HTTP Server
    participant T as TaskCoordinator
    participant CP as Checkpointer
    C->>I: Send checkpoint request (/checkpoint/duration or /checkpoint/manual)
    I->>T: Forward request for processing
    T->>CP: Initiate checkpoint process
    CP-->>T: Return checkpoint result
    T-->>I: Send response details
    I-->>C: Deliver HTTP response
sequenceDiagram
    participant Req as Restore Request
    participant E as CreateCheckpointRestoreEventService
    participant H as Manual Restore Handler
    Req->>E: Send restore checkpoint request (type "MANUAL")
    E->>H: Invoke #handleManualCheckpointRestore
    H->>E: Process metadata and update statuses
    E-->>Req: Return restoration outcome
Poem
 📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
 Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit: 
 Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
 Other keywords and placeholders
 CodeRabbit Configuration File ( | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (5)
apps/coordinator/src/index.ts (1)
1392-1621: Consolidate duplicated logic in/checkpoint/durationand/checkpoint/manual.The new internal endpoints are well-structured and straightforward. However, they both perform similar checkpoint creation logic. Consider extracting the checkpoint creation path and repeated reply steps into a shared helper to reduce duplication, especially if more checkpoint endpoints are added in the future.
Also, you’re using
console.logat lines 1446 and 1547. For consistency with the rest of the coordinator, consider switching to the structured logger (e.g.,logger.log) for more uniform and searchable logs.apps/coordinator/src/util.ts (1)
21-31: Graceful fallback with null return.Returning
nullon syntax errors avoids runtime crashes. However, clarifying the possible return values (undefined,null, or parsed object) in your JSDoc or type annotations could help avoid confusion and reduce potential type errors in the rest of the codebase.apps/webapp/app/database-types.ts (2)
73-75: LGTM! Consider cachingObject.values()result.The type predicate implementation is correct and enhances type safety. For better performance, consider caching the result of
Object.values()since the status objects are constant.+const taskRunAttemptStatusValues = Object.values(TaskRunAttemptStatus); export function isTaskRunAttemptStatus(value: string): value is keyof typeof TaskRunAttemptStatus { - return Object.values(TaskRunAttemptStatus).includes(value as keyof typeof TaskRunAttemptStatus); + return taskRunAttemptStatusValues.includes(value as keyof typeof TaskRunAttemptStatus); }
77-79: LGTM! Consider cachingObject.values()result.Similar to the previous function, consider caching the result of
Object.values()for better performance.+const taskRunStatusValues = Object.values(TaskRunStatus); export function isTaskRunStatus(value: string): value is keyof typeof TaskRunStatus { - return Object.values(TaskRunStatus).includes(value as keyof typeof TaskRunStatus); + return taskRunStatusValues.includes(value as keyof typeof TaskRunStatus); }apps/webapp/app/v3/services/createCheckpoint.server.ts (1)
154-164: Consider adding validation forrestoreAtUnixTimeMs.The implementation allows for immediate restoration if
restoreAtUnixTimeMsis not specified. However, consider adding validation to ensure the timestamp is not in the past.if (reason.type === "MANUAL") { // Restore immediately if not specified, useful for live migration - restoreAtUnixTimeMs = reason.restoreAtUnixTimeMs ?? Date.now(); + const now = Date.now(); + restoreAtUnixTimeMs = reason.restoreAtUnixTimeMs ?? now; + if (restoreAtUnixTimeMs < now) { + logger.warn("Manual checkpoint restore time is in the past, using current time", { + requestedTime: restoreAtUnixTimeMs, + currentTime: now, + }); + restoreAtUnixTimeMs = now; + } }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
- apps/coordinator/Containerfile(1 hunks)
- apps/coordinator/src/checkpointer.ts(3 hunks)
- apps/coordinator/src/index.ts(6 hunks)
- apps/coordinator/src/util.ts(1 hunks)
- apps/webapp/app/database-types.ts(1 hunks)
- apps/webapp/app/v3/services/createCheckpoint.server.ts(5 hunks)
- apps/webapp/app/v3/services/createCheckpointRestoreEvent.server.ts(3 hunks)
- packages/core/src/v3/schemas/messages.ts(1 hunks)
- packages/core/src/v3/schemas/schemas.ts(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (7)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - pnpm)
- GitHub Check: e2e / 🧪 CLI v3 tests (windows-latest - npm)
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - pnpm)
- GitHub Check: e2e / 🧪 CLI v3 tests (ubuntu-latest - npm)
- GitHub Check: units / 🧪 Unit Tests
- GitHub Check: typecheck / typecheck
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (15)
apps/webapp/app/v3/services/createCheckpointRestoreEvent.server.ts (3)
8-10: Imports look good.These newly introduced imports integrate well with the existing codebase to support manual checkpoint metadata parsing and status checks.
49-54: Early return condition is clear.This conditional block correctly prevents event creation if the manual checkpoint restoration process fails. It helps avoid inconsistent states in the database.
117-192:❓ Verification inconclusive
Validate the nested update syntax.
The schema for updating nested relations in Prisma (or similar ORM) typically doesn’t require a “data” property within the
updateblock. Verify that:taskRun: { update: { status: previousRunStatus, }, }is not the required syntax, unless your schema explicitly mandates the nested “data” property. Otherwise, this may trigger runtime errors.
🏁 Script executed:
#!/bin/bash # Description: Grep for 'taskRun: { update: { data:' usage to confirm consistency with your Prisma schema. rg -A 5 $'taskRun:\\s*{\\s*update:\\s*{\\s*data:'Length of output: 175
Action Required: Verify Nested Update Syntax for
taskRunThe current code uses the following nested update:
taskRun: { update: { data: { status: previousRunStatus, }, }, },Most Prisma nested writes for one-to-one relations follow the pattern without wrapping the update fields inside a
dataobject, for example:taskRun: { update: { status: previousRunStatus, }, },Please verify that your Prisma schema requires the nested
dataproperty. If not, consider updating the syntax to the canonical form since the extra nesting might trigger runtime errors. Additionally, note that our initial regex search for similar usages encountered errors due to regex syntax. You may want to re-run the following command manually or update the regex to use POSIX classes for a more reliable verification:rg -A 5 'taskRun:[[:space:]]*{[[:space:]]*update:[[:space:]]*{[[:space:]]*data:' apps/webapp/app/v3/services/createCheckpointRestoreEvent.server.tsapps/coordinator/src/index.ts (7)
17-17: Safe JSON parse import is appropriate.Importing
safeJsonParsefrom “./util” is aligned with the new endpoints’ need for robust JSON handling.
45-46: New internal server instance creation.Defining
#internalHttpServerseparately is a neat approach to isolating manual checkpoint endpoints from the publicly exposed HTTP server.
84-84: Initialization of internal server.You consistently start the internal server creation alongside the main server, which helps keep the code structure consistent.
1375-1378: Improved 404 response for the primary HTTP server.Returning a 404 for unknown paths is a standard best practice that prevents information leakage.
1381-1390: Client error handling ensures graceful fallback.Ending the socket connection with a 400 response is a straightforward way to handle malformed requests, preventing potential partial data issues.
1617-1617: Internal server logs are helpful.Logging the internal server’s port on startup is valuable for debugging and monitoring.
1625-1625: Binding internal server to localhost.Listening on
127.0.0.1is a good approach to prevent external access to potentially sensitive endpoints.apps/coordinator/Containerfile (1)
15-17:❓ Verification inconclusive
Verify the need for
busyboxin the coordinator service.The addition of
busyboxsuggests a need for shell utilities. Please clarify the specific use cases that requirebusyboxto ensure it's necessary for the coordinator service.Run the following script to check if
busyboxis used in the codebase:
🏁 Script executed:
#!/bin/bash # Description: Search for potential uses of busybox utilities in the codebase. # Test: Search for shell commands that might use busybox utilities. rg -A 5 'exec[^a-zA-Z]|spawn|sh -c|/bin/|/usr/bin/'Length of output: 67144
Coordinator Container Image: Clarification on BusyBox Requirement
I couldn’t identify any explicit invocation of BusyBox commands within the coordinator service’s code beyond its installation in the Containerfile (lines 15–17). While other services (e.g. in the Kubernetes provider) use BusyBox (e.g. for wget), there’s no direct evidence that the coordinator service leverages its utilities. Could you please clarify whether BusyBox is needed—for example, to supply essential shell utilities on a minimal base—or if its inclusion can be reconsidered? If it’s required indirectly (e.g. by build tools or for consistency), adding a brief code comment would help future maintenance.
apps/webapp/app/v3/services/createCheckpoint.server.ts (1)
104-115: LGTM! Metadata handling for manual checkpoints.The implementation correctly constructs metadata for manual checkpoints using the
satisfiesoperator to ensure type safety.apps/coordinator/src/checkpointer.ts (2)
418-424: LGTM! Well-structured helper function.The
removeCurrentAbortControllerfunction is a good extraction of common logic that safely handles abort controller cleanup by ensuring only the current controller is removed and properly detaching event listeners to prevent memory leaks.
435-435: LGTM! Consistent usage of the helper function.The cleanup function now consistently uses the
removeCurrentAbortControllerhelper in both docker and non-docker cleanup paths, improving code maintainability.Also applies to: 448-448
packages/core/src/v3/schemas/messages.ts (1)
482-486: LGTM! Well-designed schema for manual checkpoints.The new
MANUALcase in theCHECKPOINT_CREATEDmessage schema is well-designed:
- Aligns with the PR objective to add manual checkpoint support
- The optional
restoreAtUnixTimeMsfield provides flexibility for both immediate and delayed restoration- Clear documentation explains the default behavior
| export const ManualCheckpointMetadata = z.object({ | ||
| /** NOT a friendly ID */ | ||
| attemptId: z.string(), | ||
| previousRunStatus: z.string(), | ||
| previousAttemptStatus: z.string(), | ||
| }); | ||
|  | ||
| export type ManualCheckpointMetadata = z.infer<typeof ManualCheckpointMetadata>; | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Enhance schema validation for status fields.
The schema correctly defines the structure for manual checkpoint metadata. However, consider using the existing status enums for better type safety.
 export const ManualCheckpointMetadata = z.object({
   /** NOT a friendly ID */
   attemptId: z.string(),
-  previousRunStatus: z.string(),
-  previousAttemptStatus: z.string(),
+  previousRunStatus: z.enum(Object.values(TaskRunStatus) as [string, ...string[]]),
+  previousAttemptStatus: z.enum(Object.values(TaskRunAttemptStatus) as [string, ...string[]]),
 });📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| export const ManualCheckpointMetadata = z.object({ | |
| /** NOT a friendly ID */ | |
| attemptId: z.string(), | |
| previousRunStatus: z.string(), | |
| previousAttemptStatus: z.string(), | |
| }); | |
| export type ManualCheckpointMetadata = z.infer<typeof ManualCheckpointMetadata>; | |
| export const ManualCheckpointMetadata = z.object({ | |
| /** NOT a friendly ID */ | |
| attemptId: z.string(), | |
| previousRunStatus: z.enum(Object.values(TaskRunStatus) as [string, ...string[]]), | |
| previousAttemptStatus: z.enum(Object.values(TaskRunAttemptStatus) as [string, ...string[]]), | |
| }); | |
| export type ManualCheckpointMetadata = z.infer<typeof ManualCheckpointMetadata>; | 
This also adds a fix for lazy attempt errors that's already live.
Summary by CodeRabbit